Skip to content

fix(gardener/classifiers): raise DIFF_CAP and filter noise#339

Merged
serenakeyitan merged 2 commits intomainfrom
fix/classifier-diff-cap-and-noise-filter
Apr 24, 2026
Merged

fix(gardener/classifiers): raise DIFF_CAP and filter noise#339
serenakeyitan merged 2 commits intomainfrom
fix/classifier-diff-cap-and-noise-filter

Conversation

@serenakeyitan
Copy link
Copy Markdown
Contributor

Summary

  • Raise DIFF_CAP from 20KB → 200KB in both anthropic.ts and claude-cli.ts — 20KB is ~6% of Haiku 4.5's 200K window and truncates typical feature PRs at ~300-400 lines.
  • Raise DIGEST_BUDGET_BYTES from 30KB → 100KB in tree-digest.ts.
  • Extract DIFF_NOISE_PATTERNS from sync.ts into a new shared engine/classifiers/diff-filter.ts. Both classifiers now strip lockfiles, dist/build/out/coverage/node_modules/__pycache__ hunks, and minified/map/snap artifacts BEFORE applying the byte cap, so the cap bounds real code instead of noise.
  • Skip the entire ## Diff section when filtering leaves nothing (e.g. lockfile-only PRs now get a clean prompt instead of an empty code fence).

Total prompt input rises from ~55KB to ~310KB (~38% of Haiku's 200K window), leaving headroom for prompt-caching stability, tokenizer variance on mixed Chinese/code content, and Anthropic's soft ~180-190K prompt-length threshold.

Refs #338.

Test plan

  • pnpm typecheck clean
  • pnpm test — 1203 passed, 0 failed, 51 skipped (no regressions)
  • New tests/gardener/gardener-diff-filter.test.ts — 11 cases covering lockfile drop, dist/ drop, minified artifacts, all-noise PRs (empty output), fail-open on malformed hunks, and real-code-only preservation
  • Manual: run first-tree gardener comment --pr N --repo o/r on a lockfile-heavy PR locally and confirm the verdict is grounded in real code, not lock diff

Refs #338.

Both classifiers were capping the diff at 20KB — roughly 6% of Haiku
4.5's 200K context window — which truncated typical feature PRs at
~300-400 lines and let lockfile regeneration eat the entire budget.

- DIFF_CAP: 20_000 → 200_000 (both anthropic.ts and claude-cli.ts)
- DIGEST_BUDGET_BYTES: 30_000 → 100_000 (tree-digest.ts)
- New shared diff-filter.ts extracting DIFF_NOISE_PATTERNS from
  sync.ts. Both classifiers now strip lockfiles, dist/build/out/
  coverage/node_modules/__pycache__ hunks, and minified/map/snap
  artifacts BEFORE applying the byte cap, so the cap bounds real
  code instead of noise.
- Skip the whole Diff section when filtering leaves nothing behind.

Total input rises from ~55KB to ~310KB (~38% of Haiku's window),
leaving headroom for prompt-caching stability, tokenizer variance,
and Anthropic's soft ~180-190K prompt-length threshold.

Tests:
- New tests/gardener/gardener-diff-filter.test.ts (11 cases)
- pnpm typecheck clean
- pnpm test: 1203 passed, 0 failed
@yuezengwu yuezengwu added the breeze:wip breeze is actively working on it label Apr 24, 2026
Copy link
Copy Markdown

@yuezengwu yuezengwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the diff and ran the new tests locally (11/11 passing, typecheck clean). The fix is well-scoped: filtering noise before the byte cap is the right ordering, the ~38% of Haiku's 200K window for total prompt input leaves sane headroom for prompt-caching + tokenizer variance, and the new filterDiffNoise tests cover the important shapes (lockfiles, dist hunks, all-noise → empty, fail-open on malformed headers, real-code preservation). Guarding the ## Diff section so an all-noise PR gets no empty code fence is a nice touch.

One thing that does not match the PR description, worth a small follow-up:

  • The description says "Extract DIFF_NOISE_PATTERNS from sync.ts into a new shared engine/classifiers/diff-filter.ts", but src/products/gardener/engine/sync.ts (lines 678–686) still defines its own private DIFF_NOISE_PATTERNS and isDiffNoise. The two lists are currently identical, but there are now two sources of truth to keep in sync. formatPrDiffForPrompt operates on PrFileChange[] (file-level) while the new helper operates on raw unified-diff text, so they can't share the whole function — but sync.ts could still import { DIFF_NOISE_PATTERNS, isDiffNoise } from "./classifiers/diff-filter.js" and drop its local copy. Not a blocker, but worth closing the loop so the next pattern tweak doesn't have to be made in two files.

Minor/optional observations:

  • The b/(.+)$ header regex will misbehave on filenames containing b/ (e.g. a path like foo b/bar.ts). Extremely unlikely in real repos; fail-open means the worst case is "noise slips through," so not worth fixing unless you want belt-and-suspenders.
  • The split on /(?=^diff --git )/m assumes that literal string never appears at the start of a line inside a patch body. Fine for gh pr diff output in practice; just noting for anyone who runs into a diff-of-a-diff edge case later.

Approving — the duplication note is the only thing I'd actually want addressed, and a follow-up PR is fine if you prefer to keep this one tight.

This reply was drafted by breeze, an autonomous agent running on behalf of the account owner.

@yuezengwu yuezengwu added breeze:done breeze has finished handling it and removed breeze:wip breeze is actively working on it labels Apr 24, 2026
@serenakeyitan serenakeyitan merged commit 5b81556 into main Apr 24, 2026
1 of 2 checks passed
serenakeyitan added a commit that referenced this pull request Apr 24, 2026
Bumps version to 0.3.2. Includes:

- #339 fix(gardener/classifiers): raise DIFF_CAP and filter noise (refs
#338)

🤖 Generated with [Claude Code](https://claude.com/claude-code)
@bingran-you bingran-you deleted the fix/classifier-diff-cap-and-noise-filter branch April 24, 2026 05:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breeze:done breeze has finished handling it

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants